https://brand.wsu.edu/visual/colors/
This midterm exam (Rnotebook-midterm) is worth 150 points. For each question, review how much the item is worth in terms of points and plan your time wisely.
I would deem it “unwise” to spend hours on a question that is only worth 5 points.
This is an open-book examination. You can use your course notebooks (digital and old-school). You can use Internet resources (stackoverflow, Wikipedia, and Youtube).
You cannot use a living resource on the exam. That would include a classmate, student, sibling, parent, a tutor, online forums (where you ask the question after the exam period has begun).
If you have questions that need clarifying, please email the instructor and he will try to answer them by email or ZOOM. He will be checking his email often during the week of the exam, to make himself available to you.
For every 10 points, this is the general breakdown.
| Emerging | Developing | Mastering |
|---|---|---|
| 0-4 | 5-7 | 8 - 10 |
Exploratory data analysis (EDA) is the process of analyzing data to summarize its main characteristics.
Confirmatory data analysis (CDA) is the process of applying specific statistical methods to analyze the data. The goal is also to summarize its main characteristics. We commonly refer to CDA as “statistical hypothesis testing”. CDA generally makes assumptions about how the data is distributed. Or wants to apply a specific model to the data.
So the two approaches have the same objective: summarize the main characteristics of the data. How they achieve that objective is very different.
John Tukey is the father of “Exploratory Data Analysis” (EDA) https://en.wikipedia.org/wiki/John_Tukey. My favorite statistics book is his 1977 book (not surprising) entitled “Exploratory Data Analysis.”
From Wikipedia (Accessed October 2020): Tukey “also contributed to statistical practice and articulated the important distinction between exploratory data analysis and confirmatory data analysis, believing that much statistical methodology placed too great an emphasis on the latter.”
I belong to the “Tukey” camp. I believe too much emphasis is place on “formal statistical methods and tests”. I believe more emphasis should be placed on the underlying nature of the data. These underlying principles are how the statistical methods developed.
As a data analyst, I believe that first and foremost, we should let the data speak. That is why the first half of the semester started in this form. The second half (confirmatory data analysis) will rely on what is labeled by many as “formal statistical methods”.
In 1965, Tukey divided his time between working at Princeton University and working at Bell Labs (a research think tank).
Tukey proposed that five summary data are essential to understanding numerical data: min, max, median (technically Q2), and Q1 and Q3 (the quartiles). In R, the function summary has only added mean to Tukey’s proposal from years ago.
In 1975, Tukey invented the “box and whisker” plot that identifies the median, inter-quartile range (IQR), and outliers of data. The visualization displays the data without making any assumptions about its statistical distribution. The boxplot is a working demonstration of EDA. Let the data speak!
S and RAt the same time ast John Tukey, three other men were also working at Bell Labs (John Chambers, Rick Becker, and Allan Wilks) on a statistical programming language S that emphasized EDA. This “statistical computing” language was programming mostly in Fortran with some C programming. Chambers published his first “statistical computing” text in 1977, titled “Computational methods for data analysis” https://archive.org/details/computationalmet0000cham/page/n11/mode/2up
Between 1988 and 1991, Chambers updated the engine of S to make it more robust. That same engine still powers much of R today. That is, much of the base code of S was written by Chambers himself. R today still uses much of that S codebase under the hood.
R was an open-source offshoot (a “fork”) of S which occurred in the mid 1990s. Today, Chambers is still active in the S-nowR community. My favorite book of his is titled (2008): “Software for data analysis programming with R”. My second-favorite book of his is titled (1998): “Programming with data: a guide to the S language”. In 2016, he authored another book that I still need to read “Extending R.”
Modern R is written in Fortran, C, and C++.
Since its foundation is primarily C, we can use standard “make” and “make-install” tools to compile R or its packages from the source code. That is why we needed Rtools on Windows. The MacOS is now linux based, so no additional tools are required.
EDA as exploration is an iterative process.
I like to use the analogy of learning a foreign language using the “immersion” approach. For example, I studied Spanish in high school, learned vocabulary and grammar, and really could not speak the language well.
I did learn to speak the language well by being dropped into a foreign country for nearly two years. Some key ingredients to learn a language in an “immersive” environment are listed below:
surround yourself with others speaking the language to be learned [e.g., I did not spend a lot of time with other Americans speaking English].
be present when engaged in the language. Listen intently and try to understand as much as you can, not worrying about what you don’t fully understand.
reflect after language engagement. Try to synthesize what “gaps” you have and then develop study habits to fill in those gaps.
practice what you have learned.
To some degree, my success was likely accelerated because I had precursory training. Regardless, “immersive” practices benefit learning new languages.
[As part of your journey, I have asked you to keep a “paper-and-pencil” notebook to write down words/phrases/ideas. For example, in this section, there may be words/terms/phrases/ideas you don’t fully understand.]
Proficiency requires an iteration of these key features described above. But first, you have to understand what language you are trying to speak. Is it R? Is it Statistics? Mathematics? What exactly is the language?
In my opinion, the language is the “language of data”.
How do you think mathematics developed? It likely started with simple data, based on real-world experience: two hands, five fingers on each hand gives me the number ten. Counting in a base-10 system likely resulted. An entire domain of mathematics called “number theory” devotes its studies to these integer values.
How do you think statistics developed? People went out and started measuring things. One person would literally walk down the street in the late 1800s and ask if he could measure a person’s proportions. Another person would study crop yields at different locations and tried to ascertain if they were different.
The foundation of mathematics and statistics is data. So I believe, we should let the data speak.
So I am definitely an EDA-guy. Some people are, some people are not. I personally am a strong believer that we should let the data speak, learn how to describe the data without imposing any restrictions on it, and always think about the data first and foremost.
I also believe that we should use logic, intuition, and insight before we develop any formal “confirmatory” hypothesis testing. I have intentionally architected this course to emphasis EDA.
I am also very ademant about data provenance as I believe the “outputs” of any analysis (whether exploratory or confirmatory) is as only as good as the data quality. I call this data intimacy. You should care just as much about the process to get quality data as you do to analyze said data.
The term GIGO (garbage-in, garbage-out) in my estimation represents what happens when care for quality data is treated lightly.
This full EDA approach is a multi-lens approach. View the data from as many different perspectives as possible before arriving at a conclusion. Base your conclusion on a synthesis of what you analyzed from those different perspectives.
[ I have expressed my opinion about the study of data and the importance of EDA in that study. What is your opinion on this topic?
This is worth 10 points, a minimal answer should be at least 3 paragraphs. Agreeing/Disagreeing with my opinion is not how you will be evaluated. How well you express YOUR opinion is what is important.]
The process of exploratory data analysis (EDA) really allows one to get to know the data that they are working with, which can provide more insights than may have been initially expected. The more familiar someone is with the data they are working with, the easier it will be for them to summarize and draw conclusions on that data. EDA is also important for identifying trends that may be present in the data. Exploratory data analysis may provide awareness as to which data should be kept and which can be discarded. It can help identify unique characteristics such as important (or not so important) variables and the relationships between each of the variables in the data. This is a crucial step in cleaning and organizing data to tailor it to the needs of the analysis, or research question. EDA is also an important tool in developing models from data. I think that the idea about “letting the data speak” as pointed out above is really interesting and something that I haven’t thought much about. Being able to explore the data without restricting it allows one to maximize what they learn from the data. This can result in the discovery of relationships and questions to later be explored.
set.seed choice so the code is replicable. Verify that every time you run the commands, the data is not changing with the seed “you chose”.rnorm, runif to simulate data.n=9999; data for each.x.rnorm the data for the first and x.runif the data for the second.graphics::hist and report the summary statistics `base::summary of each.plot(x.rnorm, x.runif);.plot(x.rnorm, sample(x.rnorm) ); and compare it to plot(x.runif, sample(x.runif) );.set.seed(99);
n = 9999;
x.rnorm = rnorm(n, mean=0, sd=1);
x.runif = runif(n, min=0, max=1);
hist(x.rnorm);
hist(x.runif);
summary(x.rnorm);
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.514122 -0.663744 -0.021075 0.002495 0.683299 4.037498
summary(x.runif);
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000238 0.2551734 0.4975435 0.5006161 0.7535215 0.9998405
plot(x.rnorm, x.runif);
plot(x.rnorm, sample(x.rnorm));
plot(x.runif, sample(x.runif));
rnorm, runifrnorm and runif does. How are they similar? How are they different?sample function do?plot(x.rnorm, x.runif); different from plot(x.rnorm, ( x.rnorm.sample = sample(x.rnorm) ) ); and plot(x.runif, ( x.runif.sample = sample(x.runif) ) );? How would you describe the shape of each of these plots?The function `rnorm` takes as parameters the number of observations to be generated, n, and the mean and standard deviation of the distribution to be simulated. This function results in a vector of n normally distributed random numbers. The function `runif` also takes as parameters the number of observations to be generated, n, but rather than a mean and standard deviation, this function has the arguments min and max, which are the lower and upper bounds of the distribution. Similar to `rnorm`, `runif` results in a vector of random numbers, but these numbers follow a uniform distribution. Both of the above functions produce a vector of n random numbers and are useful for simulating data. These functions can also both be used for random number generation, it is just important to keep in mind that there will be a certain pattern to the generation. The most prominent difference between the two is the distribution that the random numbers follow, whether it be uniform or normal. The `sample` function results in a random sample of elements from the vector that it is passed as an argument. If the size argument of the function is not specified, as seen in the example above, the sample contains the exact same elements as the original vector, but they are likely in a different order. The size of the sample can be either smaller or larger than the original vector, but if it is larger, replacement is required in the sampling. By default, sampling using this function is set to be performed without replacement. The plot generated by `plot(x.rnorm, x.runif)` displays a plot in which random data points from a normal distribution are plotted against random data points from a uniform distribution. When looking at the data points with respect to each axis, it can be observed that the points are relatively evenly distributed between 0 and 1 on the y-axis (representing the uniform distribution), whereas on the x-axis (representing the normal distribution) the points are more densely located near 0, which is the mean of the distribution, and the further from 0 the points get, the more sparse they are. The plots generated by `plot(x.rnorm, sample(x.rnorm))` and `plot(x.runif, sample(x.runif))` both represent the same type of distributions (normal vs uniform) plotted against each other. The shape of these plots varies from the plot described above. In the `x.runif` vs. `sample(x.runif)` plot, the points are pretty evenly distributed over the entire plot. Intuitively, this is because two uniform distributions are being plotted against each other, so the data points on both the x- and y- axes are expected to be equally proportioned. Likewise, in the `x.rnorm` vs. `sample(x.rnorm)` plot, the data points on both the x- and y- axes are more densely populated around the mean of 0. The results in a plot with points very concentrated (in a circular fashion) around the coordinates (0, 0), and the further from that point you travel, the less data points you are likely to see.
set.seed using rbinom. If you search in the BlackBoard discussion forum for easter you will see the discussion about August 25-27.set.seed that would simulate flipping a coin 100 times and getting one result (heads/tails) exactly 52 times.set.seed that achieves this objective. You can report more.length(x) where x is a vector of the values that meet the objective.n = 500000;
seeds = numeric(0)
for (i in 1:n)
{
set.seed(i);
result = rbinom(n = 1, 1000, 0.5);
if (result == 555)
{
seeds = append(seeds, i);
}
}
seeds;
## [1] 49771 58927 63153 64693 74842 101575 103323 113838 121830 152054
## [11] 168654 178390 211784 225908 253759 298626 303566 316746 317968 325175
## [21] 400407 491595 492805 498906
length(seeds);
## [1] 24
cat('Length of the vector that contains seed values from 0 to', n, 'is:', length(seeds));
## Length of the vector that contains seed values from 0 to 500000 is: 24
cat('The', length(seeds), 'values between 0 and', n, 'for set.seed that achieve getting one result exactly 555 times in 1,000 coin tosses are:', paste(seeds, collapse=', '))
## The 24 values between 0 and 500000 for set.seed that achieve getting one result exactly 555 times in 1,000 coin tosses are: 49771, 58927, 63153, 64693, 74842, 101575, 103323, 113838, 121830, 152054, 168654, 178390, 211784, 225908, 253759, 298626, 303566, 316746, 317968, 325175, 400407, 491595, 492805, 498906
1:10 … they are ten-sided die (“decader” die).for-loops to capture all possible outcomes of rolling the three dice at the same time.myrolls should have three columns: dice.1, dice.2, dice.3 plus a fourth column roll.total which is the sum dice.1 + dice.2 + dice.3 of one iteration of the nested for loop.dim of myrolls.outcomes.table that summarizes the counts of the myrolls$roll.totaloutcomes.df. Name the columns: c(“roll.total”, “count”);outcome.df$countoutcomes.df$prob (Probability) that determines the probability of that row given the total sum of the count column.## your code goes here ...
dice.1 = dice.2 = dice.3 = 1:10;
myrolls = NULL;
for(d1 in dice.1)
{
for(d2 in dice.2)
{
for(d3 in dice.3)
{
roll.total = d1 + d2 + d3;
row = c(d1, d2, d3, roll.total);
myrolls = rbind(myrolls, row);
}
}
}
myrolls = as.data.frame(myrolls);
colnames(myrolls) = c("dice.1", "dice.2", "dice.3", "roll.total");
myrolls;
print("Dimensions of myrolls");
## [1] "Dimensions of myrolls"
dim(myrolls);
## [1] 1000 4
outcomes.table = table(myrolls$roll.total);
outcomes.df = as.data.frame(outcomes.table);
colnames(outcomes.df) = c("roll.total", "count");
total.sum = sum(outcomes.df$count);
print("Sum of outcomes.df$count");
## [1] "Sum of outcomes.df$count"
total.sum;
## [1] 1000
outcomes.df$prob = outcomes.df$count / total.sum;
outcomes.df;
# if you don't have the latest version of humanVerseWSU, you can access the function as follows:
# library(devtools);
# source_url(paste0( path.github, "humanVerseWSU/R/functions-dataframe.R" ));
sub.myrolls = subsetDataFrame(myrolls, "roll.total", "==", 23);
sub.myrolls;
sub.outcomes.df = subsetDataFrame(outcomes.df, "roll.total", "==", 23);
sub.outcomes.df;
When the three dice are thrown at the same time, a 23 can be rolled 36 different ways. The probability that a 23 will be rolled on a single throw is 0.036.
sub.outcomes.df = subsetDataFrame(outcomes.df, "roll.total", "==", 12);
sub.outcomes.df = rbind(sub.outcomes.df, subsetDataFrame(outcomes.df, "roll.total", "==", 22))
sub.outcomes.df;
sum(sub.outcomes.df$prob)
## [1] 0.1
The probability of rolling a 12 or a 22 in a single throw is 0.1.
sub.outcomes.df = subsetDataFrame(outcomes.df, "roll.total", "==", 26);
sub.outcomes.df = rbind(sub.outcomes.df, subsetDataFrame(outcomes.df, "roll.total", "==", 29))
sub.outcomes.df;
sum(sub.outcomes.df$prob)
## [1] 0.018
The probability of rolling a 26 or a 29 in a single throw is 0.018.
sub.outcomes.df = subsetDataFrame(outcomes.df, "roll.total", "==", 3);
sub.outcomes.df;
prob.twice.in.row = (sum(sub.outcomes.df$prob))^2;
prob.twice.in.row;
## [1] 0.000001
The probability that a 3 will be rolled on a single throw is 0.001. The probability that a 3 will be rolled twice in a row is 0.001*0.001, or 0.000001.
sub.outcomes.df = setNames(data.frame(matrix(ncol = 3, nrow = 0)), c("roll.total", "count", "prob"))
for (i in 3:12)
{
sub.outcomes.df = rbind(sub.outcomes.df, subsetDataFrame(outcomes.df, "roll.total", "==", i));
}
sub.outcomes.df;
sum(sub.outcomes.df$prob)
## [1] 0.22
The probability that a single throw will result in at most a 12 is 0.22.
We are going to use exploratory techniques to examine some Indeed.com data. If you recall, this job data examines how many jobs reference a certain keyword. Every Monday morning at 12:00:00AM EST (using a scheduler crontab), this data collection is performed. A few weeks ago, I added some new keys words. The data set we have consists of 5 weeks: 2020-38 to 2020-42.
jobs.jobs = utils::read.csv( paste0(path.mshaffer, "_data_/indeed-jobs.txt"), header=TRUE, quote="", sep="|");
colnames(jobs) = c("year.week", "search.query", "job.count");
jobs;
hist and boxplot and report summary(jobs$job.count).hist(jobs$job.count, breaks = 20, main='Histogram of Job Counts')
hist(jobs$job.count, breaks = 300, xlim=c(0,20000), main='Histogram of Job Counts - Zoomed in to show less than 20,000')
boxplot(jobs$job.count, main='Boxplot of Job Counts')
boxplot(jobs$job.count, ylim=c(0, 30000), main='Boxplot of Job Counts - Zoomed in to show less than 30,000')
summary(jobs$job.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 149 1734 15170 11847 404527
From the histogram, it can be gathered that this data is not symmetric, rather it is skewed to the right. A large portion of these data points fall below 20,000. This suggests that the distribution has outliers that are located to the right of the histogram (toward the upper bound of the data). When we zoom in on this histogram to show just the values that are less than 20,000, it is apparent that it is still a right-skewed distribution with nearly 400 of the total data points falling between 0 and 1,000, note that there are only 905 total job counts being graphed, so this is close to half of all of the data (approximately 400/905 = 0.4420). The boxplot further illustrates the skewedness of the data, as well as the outliers. From the boxplot, we can also see that this distribution is positively skewed (data is more concentrated toward lower values). When looking at the zoomed in boxplot, it can be seen that the first quartile and median are both relatively close to the minimum value, and the bottom whisker is much shorter than the top whisker, further supporting that the data is not symmetric. The maximum in this boxplot looks to be between about 27,000 and 30,000, which tells us that any value over that number is considered an outlier. The median, which looks to be about 2,000, tells us that 50% of all data points fall below that number, and 50% fall above. This is consistent with the conclucions drawn from the histogram. Visully, the first quartile looks very close to the minimum value, suggesting that 25% of the data points are closely clustered around the minimum and the third quartile, which looks to fall around 12,000, tells us that only 25% of the data points are above this number and 75% are below. The summary of the job count column represents numerically the inferences made from the plots. The summary tells us that the minimum value is 0 and the first quartile is 149. To put into perspective how close these numbers are in the context of this data, the range between the minimum and maximum values is 404,527. This means that 25% of the total data falls within only 0.0368% of the total range between values. One piece of data that the summary gives us that the plots didn't necessarily provide is the mean of the data. This number is 15,170, whereas the median is only 1,734. This helps to illustrate the effect of the outliers on the mean of the data set. Because the mean is so much larger than the median, it is likely that the outliers are significantly larger than most of the data. In this case, the mean is even larger than the third quartile. The summary also provides us with the information that the maximum of the whole data set is 404,527. From the boxplot above, we see that this number is obviously an outlier, as the top whisker of our plot ended between 27,000 and 30,000. The main inferences that can be made from these three functions is that the data is skewed and there are significant outliers. This is important to keep in mind because these things can have an effect on our data analysis if we are not careful in the way that we work with them.
[What does the histogram tell you about the data? What does the boxplot tell you about the data? What does summary tell you about the data?]
deep.dive = c("Microsoft Office", "C++", "SQL", "Computer Science", "Python", "Java", "Statistics", "Data analysis", "Data analytics", "Javascript", "machine learning", "Git", "Tableau", "Business intelligence", "PHP", "Mysql", "MariaDB", "SAS", "SPSS", "Stata", "Data entry", "Big data", "Data science", "Power BI"); # I intentionally do not include "R" because it is return irrelevant results, I should have searched for "R programming" or "R statistics" ... which I am now doing...
or = "";
for(search in deep.dive)
{
or = paste0(or, " jobs$search.query == '",search,"' | ");
}
or = substr(or,0, strlen(or) - 2);
## TODO ... update subsetDataFrame to allow "OR" logic, currently only does "AND" ...
# jobs.subset = jobs[ or , ]; # doesn't work ...
jobs.subset = jobs[ jobs$search.query == 'Microsoft Office' | jobs$search.query == 'C++' | jobs$search.query == 'SQL' | jobs$search.query == 'Computer Science' | jobs$search.query == 'Python' | jobs$search.query == 'Java' | jobs$search.query == 'Statistics' | jobs$search.query == 'Data analysis' | jobs$search.query == 'Data analytics' | jobs$search.query == 'Javascript' | jobs$search.query == 'machine learning' | jobs$search.query == 'Git' | jobs$search.query == 'Tableau' | jobs$search.query == 'Business intelligence' | jobs$search.query == 'PHP' | jobs$search.query == 'Mysql' | jobs$search.query == 'MariaDB' | jobs$search.query == 'SAS' | jobs$search.query == 'SPSS' | jobs$search.query == 'Stata' | jobs$search.query == 'Data entry' | jobs$search.query == 'Big data' | jobs$search.query == 'Data science' | jobs$search.query == 'Power BI' , ];
# stem(jobs.subset$job.count);
# subsetDataFrame(jobs.subset, "job.count", "==", 0);
hist(jobs.subset$job.count, breaks = 20, main='Histogram of Job Counts Subset')
boxplot(jobs.subset$job.count, main='Boxplot of Job Counts Subset')
summary(jobs.subset$job.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 11553 20746 44306 53896 238111
From the histogram, it looks as though the subset of the data is still skewed-right, but not quite to the degree that the whole data set is. It seems that there may be a few 'clusters' of outliers around about 160,000 and 230,000. This may be explained by the fact that the data was collected multiple times over the span of 5 weeks, so the clusters may represent the 5 counts for the same keyword over the time period. The mode in this histogram is the bin between 1,000 and 2,000, showing that this is the single range in which most data points fall relative to all other ranges of length 1,000. The boxplot created above is consistent with the conclusions drawn from the histogram, further supporting these claims. The data is skewed, but not as much as the original data set. When looking at the median, the values below (the first quartile and minimum) are closer together that the values above (the third quartile and maximum), suggesting that data is more concentrated on the lower side of the median. The boxplot contains the same 'clusters' of outliers as the histogram suggested. The maximum of the boxplot looks to fall around 120,000, therefore all values above this number are considered outliers of the data. Although it was not required for this question, I also ran a summary on the subset of job counts to illustrate numerically what is going on. This summary shows that subsetting the original data got rid of some of the larger outliers, as the original data had a maximum of 404,527, and the subset has a maximum of only 238,111. The mean of the subset is still significantly larger than the median of the subset, which can be expected due to the data being skewed right and outliers falling on the upper side of the data, but the difference of the two is not quite as drastic of the difference between the two in the original dataset. The spread of the data in the subset is not as large as the original data.
[What does the histogram tell you about the data? What does the boxplot tell you about the data?]
jobs.subset$year.week = as.numeric( gsub("-",".",jobs.subset$year.week, fixed=TRUE) );
jobs.subset = sortDataFrameByNumericColumns(jobs.subset, c("year.week","job.count"), c("ASC","DESC") );
# easier to manage as "how many thousand jobs"
jobs.subset$job.count.k = jobs.subset$job.count / 1000;
do.nothing = plotJobs(jobs.subset);
do.nothing = plotJobs(jobs.subset, myy.lim = c(0,42) );
Looking at the data visualized in this way, my initial perspective is that the counts of many of the keywords look to be increasing slightly at a relatively constant rate. Another observation that I made is that the keywords with higher counts, such as Microsoft Office, C++, and Data Entry, seem to be more generalized skills, likely to be used in a variety of jobs, whereas the keywords with lower counts, such as Power BI, Stata, and MariaDB, are skills that are more specific to certain niches in the Data Analytics field. "Parallel-ish" lines on this plot signify that the rate at which each job count is changing is approximately proportional to the rates that the job counts for other keywords are changing. From this chart, it does not seem that most of the job counts are changing very drastically, but because the job counts for each keyword do generally seem to be increasing, this may represent a growing job market and increased job demand in the field. These lines are following an increasing, or positive, trend. The line representing the data "Data analysis" seems to be following the trend of many of the other lines. Although the job counts are not increasing at a constant rate over the 5 weeks, they are generally increasing at some rate. One thing that I did notice is that between weeks 40 and 41 there seems to be a bit of a slump in the line where the count did not seem to increase as drastically as it did between other weeks. Compared to other Search-Query words, "Data analysis" seems to increase at a faster rate in some weeks, and at a slower week some weeks, but overall it follows the same trend as the other words.
[What is your initial “perspective” of this data, now that you see it? Why are the lines “parallel-ish”? What kind of a trend is that?
Now comment on the line of data for “Data analysis”. How is it trending? How does it compare to other Search-Query Words?
What is your first perspective?
]
do.nothing = plotJobs(jobs.subset, myy.lim = c(0,20) );
Git-40/Git-41 data history, notice the date-time and file sizes …
It seems highly unlikely (almost impossible) that the count for the "Git" search-query dropped to 0 in week 40, especially considering that the weeks before and after this week have counts of 18,617 and 19,528, respectively. A more probable explanation for the sudden, drastic drop is missing data. There are multiple ways to deal with missing data, but in this case, because the trend line is increasing somewhat constantly, I think it would be appropriate to average the data for the two weeks before and after the missing data point and use that number for the missing data. We could also just fill that specific point with either the count that happened directly before or directly after the missing point. With this data set in particular, I noticed that there are counts for both "Git" and "git", and the counts for a given week are extremely similar. Because of this, it also seems like a valid solution would be to use the "git" count for week 40 to replace the missing data point.
[Is this data missing or did it just drop to zero that week? How does this relate to the other data (remember the idea of “continuity” in mathematics)? What should you do about it?]
idxs.week.40 = which(jobs.subset$year.week == 2020.40);
idxs.Git = which(jobs.subset$search.query == "Git");
# set notation
my.idx = intersect(idxs.Git,idxs.week.40);
jobs.subset[idxs.Git,];
jobs.subset[my.idx,];
## change this if you feel appropriate? To what number?
## I chose to change this to an average of the counts from weeks 38, 39, 40, and 41, which is 19090
jobs.subset[my.idx,3] = 19090; # job.count
jobs.subset[my.idx,4] = 19090/1000; # job.count.k (in thousands) ...
do.nothing = plotJobs(jobs.subset, myy.lim = c(0,20) );
I use the term “bigger” or “better” intentionally. We are comparing two items, and these are generic ways of communicating such a comparison. In context of this data problem, a formalized form of the question would be something like: “Utilizing job count for a given search query, determine if the query ‘Microsoft Office’ has a larger job count than the query ‘C++’?” This question will need to be formalized if we are trying to draw specific conclusions, but the initiation of analysis “which is bigger” allows us to understand what the data says or does not say, through exploration.
To answer the question:
Mathematically, if two lines are parallel, and one is above the other, can we use distance to draw a conclusion? Now, many times in statistics we deal with noise in the data, it is not “deterministic” but “stochastic” … so we need to understand the variability. Based on the data we see, can we not use “parallel-line” logic to conclude that they are different? This is one dimension of EDA, use mathematics. (“mathematics”)
do.nothing = plotJobs(jobs.subset, myy.lim = c(80,250) );
boxplotJobQueryComparison(jobs.subset, "Microsoft Office", "C++");
summary(jobs.subset[jobs.subset$search.query == 'Microsoft Office', ]$job.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 223663 228397 231979 230873 232215 238111
summary(jobs.subset[jobs.subset$search.query == 'C++', ]$job.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 158317 161156 163044 163445 164920 169786
Tukey invented the boxplot as a nice EDA representation of the data. What logical inference can we make about the distances between the boxplots and the fact that no data is overlapping. This is another dimension of EDA, use “distance” and the boxplot “IQR” to compare two elements. What conclusion would we make? (“boxplot”)
We can conclude that the data are different based on both "mathematics" and the "boxplot". Above, I performed summary statistics on the "job.count" column for both the "Microsoft Office" and "C++" search-queries and the results were consistent with the boxplots. Visually, the boxplots show us that there is not any overlap in the data, and the summary statistics prove this same idea mathematically. The minimum value for the "Microsoft Office" job count, 22,3663, was higher than the maximum value for the "C++" job count, 16,9786, further showing that there is not any overlap between the data of the two search-queries. In this case, I do not believe that a formal inferential statistical test would tell us something different than logical inference. Although this is not always the case, the distance between "Microsoft Office" data and "C++" data is significant enough that it seems intuitive to see that there is significant difference between the data. I think that formal inferential statistical tests were derived from a combination of mathematics, boxplots, and other mathematical and statistical models.
[Can we conclude the data are different based on “mathematics” or “boxplot”?
Would a formal “inferential statistical test” tell us something different than logical inference?
How do you think “formal tests” were derived if not from “mathematics” and “boxplot” (EDA)?]
# courage in trusting your intuition may require a fall-back ... for those that need it ...
t.test.jobs(jobs.subset, "Microsoft Office", "C++");
##
## Welch Two Sample t-test
##
## data: x and y
## t = 22.016, df = 7.6613, p-value = 0.00000003332
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 60.31098 74.54582
## sample estimates:
## mean of x mean of y
## 230.8730 163.4446
do.nothing = plotJobs(jobs.subset, myy.lim = c(49,53) );
do.nothing = boxplotJobQueryComparison(jobs.subset, "Statistics", "Java");
summary(jobs.subset[jobs.subset$search.query == 'Statistics', ]$job.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49383 50588 51850 51399 52159 53016
summary(jobs.subset[jobs.subset$search.query == 'Java', ]$job.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49993 51155 52113 51820 52872 52969
t.test.jobs(jobs.subset, "Statistics", "Java")
##
## Welch Two Sample t-test
##
## data: x and y
## t = -0.49621, df = 7.8738, p-value = 0.6333
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.384104 1.541704
## sample estimates:
## mean of x mean of y
## 51.3992 51.8204
For this data, I can descriptively report that the third-quartile Q3 of “Statistics” is about equal to the median of “Java”. The inter-quartile range (IQR) of each overlap. The minimum value of “Java” is larger than the minimum value of “Statistics”. The maximum value of “Java” is slightly smaller than the maximum value of “Statistics”.
The interpretation of the question "Is 'Statistics' bigger than 'Java'?" may vary depending on who is answering the question, but I have chosen to interpret it as such: "Are the job counts for 'Statistics' significantly larger than those of 'Java'?" Based on "mathematics", "boxplot". and "ttest", it is not reasonable to conclude that there is significant difference between the job counts of "Statistics" and "Java", therefore, we cannot say that "Statistics" is bigger than "Java". As stated above, there are points where the summary statistics of "Statistics" exceed those of "Java", but it is also true the other way around. Visually, the boxplot shows much less difference between the two keywords that we saw in the above example. In fact, the data for "Java" seems to lay inside the range of the data for "Statistics". Furthermore, looking at the t-test for the job counts of these two search-queries, we see that the p-value for this specific test is 0.6333. Thus, we fail to reject the null hypothesis (the data between the two keywords is not significantly different), and cannot conclude significant difference between the two data sets.
[Use “mathematics” and “boxplot” and “ttest” to answer the question: Is “Statistics” bigger than “Java”?]
do.nothing = plotJobs(jobs.subset, myy.lim = c(14,17) );
boxplotJobQueryComparison(jobs.subset, "Data science", "Big data");
summary(jobs.subset[jobs.subset$search.query == 'Data science', ]$job.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15044 15587 15998 15908 16247 16665
summary(jobs.subset[jobs.subset$search.query == 'Big data', ]$job.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15737 15997 16463 16267 16518 16619
t.test.jobs(jobs.subset, "Data science", "Big data")
##
## Welch Two Sample t-test
##
## data: x and y
## t = -1.1002, df = 6.6285, p-value = 0.3096
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1381677 0.4209677
## sample estimates:
## mean of x mean of y
## 15.9082 16.2668
As in the previous example, it does not seem plausible to draw the conclusion that there is significant difference between the data for these two keywords based on the boxplot and summary statistics of the two. When the t-test is performed with a null hypothesis that the data between the two keywords is not significantly different, it results in a p-value of 0.3096. In this case, we also fail to reject the null hypothesis, so we cannot conclude that the data are statistically different. The overlap between the keywords "Data science" and "Big data", and many of the other keywords as well, is definitely something that should be paid attention to when dealing with this data. Many of the words that data is being collected for are very closely related, and may even be used interchangeably in some instances. When dealing with the keywords "Data science" and "Big data" in particular, I would imagine that many of the job postings that contain the keyword "Big data" would also contain "Data science", as the idea of big data is a subset of the data science field. The other comparisons that were performed may be "very likely okay" because the keywords that are being compared are less likely to result in overlap. The keywords "Microsoft Office" and "C++", and "Statistics" and "Java" are not as closely related to each other as the keywords "Data science" and "Big data" are. An improved approach to collecting the data that may allow more accurate comparison of these two keywords would be to also collect the number of job postings that include both keywords. This way, we would be able to see not only how many job postings possess both of the keywords, but also how many include only one or the other of these keywords.
[Use “mathematics” and “boxplot” and “ttest” to make a conclusion.
Next, think carefully about the nature of the data. Is the “collection-approach” flawed to make a conclusion comparing job-counts of these specific keywords? How likely is it that a single-job posting may have both keywords?
This is an example where “data-integrity” knowledge would surpass the other three logical conclusions.
This intuition requires an understanding of what mathematicians call “set theory”. If I am doing an independent search on keywords, is it possible that one job would show up in multiple searches. That is, being counted twice or more.
Intuition and logic would also allow us to conclude that our other comparisons are “very likely okay”. Why?
What would be an improved approach to collecting the data that would allow me to more accurately compare these two keywords?
This is representative of why exploratory data analysis is essential. It provides us insight into the domain and highlights the need for better data, if we can find it.
]
Distance is a fundamental unit of comparison. We can use our “mathematical” understanding of distance. We can use an “EDA” understanding of the data (e.g., the boxplot). We need to understand the data sourcing and how that will relate to the logical conclusions we are trying to draw.
When we transition to “confirmatory inferential statistics”, we cannot leave our understanding of “maths” and “EDA” behind. They are the foundation from which “inferential statistics” is built. They are “logical inference”.
If you recall, we had a notebook on collecting data from Wikipedia. We documented the “data-provenance” protocols to make this happen. We have documented and can replicate our data-collection strategies.
Imagine you are preparing for a job interview. Write a 90-second blurb describing “what is data provenance” and “why it matters”. I would suggest the STAR(S) approach mentioned in one of the notebooks. Reference the “Wikipedia” project as an example of how one can implement the features.
Data provenance is the care, or management, of data, from the raw, original data, to whatever present form the data may take on. Not only is data provenance the collection of the data, but it is also the documentation of how that data was organized, cleaned, and prepared in order to answer whatever research questions are posed. One goal of data provenance is to get to a place where you can start asking some questions, which are often prompted by the data itself. This is an appropriate time to map the research data to other data and formulate questions that may involve both sources of data. One importance of data provenance stems from the ability to track and recreate the entire process regarding the data (from collection to analysis). This makes it easier to find potential errors, and also allows others to recreate the analysis being performed. Overall, data provenance provides the who, what, when, where, why, and how of the production of the data in question and verifies the authenticity and reliability of the data for those who may not have been involved in collecting, organizing, and/or analyzing the data.
[What is data provenance?
Probably about 200-250 words (with the 90 second limit) ]
“Geo-spatial” studies are becoming much more common in the “data analytics” community, so let’s use basic “latitude/longitude” data to formally talk about “distance.”
So we will look at the 50 state capitals of America (USA). Before that, let’s examine some basic principles of distances using my hometown.
cfallscfalls (use lat/long provide from the Wikipedia lookup)… build the bounding “box” and perform the post-hoc “radial distance” computations (as we did in the homework).# copy/paste __student_access__/_SECRET_/_SECRET_database_.txt into console... or this won't work
cfalls.latitude = 48.37028;
cfalls.longitude = -114.18889;
my.radius = 22; my.units = "mi"; #miles
# THIS is where these exam functions live ...
# source_url( paste0(path.github,"misc/functions-midterm-F2000.R") ); # should be 2020 ... oh well
cfalls.info = getNeighborsFromLatLong(22, 48.37028, -114.18889, "mi");
## [1] "The QUERY returned ... 13 ... NEIGHBORS"
cfalls.info$neighbors;
############## plotting ##############
brown = "#ffe4c4";
green = "#014421";
my.state = "montana";
my.state.color = "#ffe4c4";
my.county = "flathead";
my.county.color = "#014421";
my.nearby.states = c("idaho", "washington", "oregon");
plotNeighbors(cfalls.info,
state = my.state,
state.color = my.state.color,
state.border = 0.05,
county = my.county,
county.border = 0.05, # if you don't see the box, increase this to like 0.75
county.color = my.county.color,
nearby.states = my.nearby.states);
In the `buildBoundingBoxFromTheRadiusAndGivenLatitudeLongitude` function, the bounding box is determined by finding delta.latitude and delta.longitude and then adding those values to each direction (north, south, east, and west) of our given latitude and longitude. The values delta.latitude and delta.longitude are found by taking the ratio between the given radius and a set factor (factor.lat or factor.long), which differs whether we are dealing with latitude or longitude. When examining these variables, I noticed that factor.long is larger than factor.lat, thus, because we are dividing the radius by these factors, delta.longitude ends up being a smaller number. (Note: when finding delta.long, we are actually dividing the radius by factor.long multiplied by the cosine of delta.latitude in radians). When we add delta.long and delta.lat to each direction of the given latitude and longitude, the resulting rectangle ends up being skinnier than it is tall because delta.long is a smaller number than delta.lat. Overall, this visualization does a good job visually portraying the information of our data frame. I appreciate that the visual shows where the bounds for the 'neighbors' are, rather than just showing the neighbors themselves. One change that I would make to this image in particular is to have more contrast between the color of the state and the bounding box. I also think that it would be helpful if, instead of a rectangle, the bound was shown by a circle. This would make more sense to me since we are looking for neighboring cities in a certain radius, implying that the bound is circular.
[ - why is the box not a square, but a rectangle? … see factor.lat and factor.long in function buildBoundingBoxFromRadiusAndGivenLatitudeLongitude - critique the visualization … what do you like? what would make it better?]
Instead of cfalls.info, you do hometown.info
13 milessandpoint.latitude = 48.266667; sandpoint.longitude = -116.566667; my.radius = 30; my.units = "mi"; #miles sandpoint.info = getNeighborsFromLatLong(my.radius, sandpoint.latitude, sandpoint.longitude, my.units);## [1] "The QUERY returned ... 22 ... NEIGHBORS"sandpoint.info$neighbors;my.state = "idaho"; my.state.color = "#9fa0a1"; my.county = "bonner"; my.county.color = "#a11818"; my.nearby.states = c("oregon", "washington", "montana", "wyoming"); plotNeighbors(sandpoint.info, state=my.state, state.color=my.state.color, state.border=0.05, county=my.county, county.border=0.10, county.color=my.county.color, nearby.states=my.nearby.states);
From Wikipedia, we grabbed one page that listed the 50 U.S. cities that are designated the capitals of each individual state in America (United States of America).
Using the power of a for-loop we make our functions work for us. We now have the data ready to go.
capitals = utils::read.csv( paste0(path.mshaffer, "_data_/state-capitals/final/state-capitals.txt"),
skip=1,
header=FALSE, quote="", sep="|");
colnames(capitals) = c("state", "capital", "latitude", "longitude", "capital.since", "area.sq.miles", "population.2019.est", "population.2019.est.MSA", "population.2019.est.CSA", "city.rank.in.state", "url");
# hack-add from https://en.wikipedia.org/wiki/ISO_3166-2:US
# TODO, grab this table "appropriately" as a new function
# Is there a dictionary for shortened city names?
# the long-field names is also an issue that needs to be improved upon in the next iteration.
capitals$st = c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"); # ,"DC","AS","GU","MP","PR","UM","VI");
myLabels = paste0(capitals$capital, ", ", capitals$st);
capitals;
usmap (ggplot2)latlong = removeAllColumnsBut(capitals,c( "state", "st", "capital", "latitude", "longitude", "population.2019.est") );
# first two elements have to be this
latlong = moveColumnsInDataFrame(latlong, c("longitude","latitude"), "before", "state");
# for transform to work
library(usmap);
latlong.transform = usmap_transform(latlong);
library(ggplot2);
### plot_usmap ...
plot_usmap(fill = "#53565A", alpha = 0.25) +
ggrepel::geom_label_repel(data = latlong.transform,
aes(x = longitude.1, y = latitude.1, label = capital),
size = 3, alpha = 0.8,
label.r = unit(0.5, "lines"), label.size = 0.5,
segment.color = "#981E32", segment.size = 1,
seed = 1002) +
scale_size_continuous(range = c(1, 16),
label = scales::comma) +
labs(title = "U.S. State Capitals",
subtitle = "Source: Wikipedia (October 2020)") +
theme(legend.position = "right")
voronoi (tripack)colors = rainbow(50, s = 0.6, v = 0.75);
## initial visualization ...
library(tripack);
# plot( voronoi.mosaic(latlong[,4:3], duplicate="remove"), col=colors, xlab="");
plot( voronoi.mosaic(x = latlong$longitude, y = latlong$latitude), col=colors, xlab="");
text(x = latlong$longitude, y = latlong$latitude, labels = latlong$capital, col=colors, cex=0.5);
map (base)### how is any of the other visualizations really any better than a simple map ... with actual locations for Alaska/Hawaii?
library(maps);
map('state', plot = TRUE, fill = FALSE,
col = "blue", myborder = 0.5
);
points(x = latlong$longitude, y = latlong$latitude,
col = "red", pch = "*", cex = 1);
Above, the data was displayed using three different visualization packages.
The first plot_usmap uses the `ggplot2 methodology which is tied to the “tidyverse” landscape of the R community.
The second voronoi.mosaic uses the graph-theory topology known as Voroni partitioning https://en.wikipedia.org/wiki/Voronoi_diagram with the “base” plot function to visualize the topology.
The last one map is from the “base” environment.
Visually, the plot that was most appealing to me is the map created using ggplot2 and plot_usmap because I like how simple the colors are and how it includes the outline for every state (unlike the other two plots). I also think that it is useful that the names of the state capitals are included on this visual. I think that functionally, the map that presents the data most effectively depends on what information is being portrayed. For instance, the map from the base environment is useful when presenting where each capital is located with respect to its individual state. One thing that I did not like about this plot though, is that although the asterisks representing the capitals of Alaska and Hawaii are located in the correct place in relation to the contiguous United States, they are just floating in the middle of blank space, rather than being outlined by the shape of the appropriate state. The image created using plot_usmap presents the data most effectively if you are interested in visualizing to the audience not only which states the capitals are located in, but also the name of the capitals. One downfall of this map is that the capitals are not geographically accurate with respect to the state itself. This image also includes Alaska and Hawaii next to California. One key factor in determining the appropriateness of including Alaska and Hawaii next to California is the information that is to be portrayed. For example, I think that it is appropriate to include these states next to California in this case, where we are more interested in the capitals of each state rather than the geography of the states themselves. A time when it may not be appropriate to do this is if we are using visualization to represent the distances between the states, because in this case, accurate geography is important to correctly represent what is going on.
[ Visually, which is the most appealing to you? Why?
Functionally, which presents the data most effectively? Why?
When we create visualizations, it is essential to portray the data accurately. For example, there are times when putting Alaska/Hawaii next to California might be appropriate, and other times it might not be.
What is a one key factor that would determine this appropriateness?
]
The dataframe we are using has been named latlong to represent the latitudes and longitudes of the 50 U.S. cities in America.
# manual conversion
# how many miles is 1 degree of latitude
latitude.factor = 69; # rough mile estimate # 68.703 ?
latlong$x.lat = latlong$latitude * latitude.factor;
longitude.factor = 54.6; # rough mile estimate
latlong$y.long = latlong$longitude * longitude.factor;
latlong = moveColumnsInDataFrame(latlong, c("y.long","x.lat"), "before", "longitude");
Let’s start with geo-spatial distances. I will do distMeeus and you will do distHaversine
These distance formulas can utilize the true geo-spatial coordinates. The distance table is getting large, so there is a helper function to lookup a certain value.
library(geosphere);
library(measurements);
dist.meeus = conv_unit( distm( latlong[,3:4], fun=distMeeus), "m", "mi"); # default meters to miles
dist.meeus.m = as.matrix( dist.meeus );
rownames(dist.meeus.m) = myLabels;
colnames(dist.meeus.m) = myLabels;
dist.meeus.df = as.data.frame( round( dist.meeus.m, digits=1) );
dist.meeus.df; ## too big
lookupPairwiseValue(dist.meeus.df, "Juneau, AK", "Montgomery, AL");
## [1] 2855.6
You can compare the two with the code below (currently in comments).
dist.haversine = conv_unit( distm( latlong[,3:4], fun=distHaversine), "m", "mi"); # default meters to miles
dist.haversine.m = as.matrix( dist.haversine );
rownames(dist.haversine.m) = myLabels;
colnames(dist.haversine.m) = myLabels;
dist.haversine.df = as.data.frame( round( dist.haversine.m, digits=1) );
dist.haversine.df; ## too big
lookupPairwiseValue(dist.haversine.df, "Juneau, AK", "Montgomery, AL");
## [1] 2854.6
# comparison code below
x = dist.meeus.df[,2]; # Juneau ... pick one location
y = dist.haversine.df[,2];
my.lim = c(0, 4000 );
plot(x, y,
xlab="Meeus",
ylab="Haversine",
main="dist( Juneau, AK )",
xlim=my.lim,
ylim=my.lim);
plotXYwithBoxPlots(x, y,
xlab="Meeus",
ylab="Haversine",
main="dist( Juneau, AK )",
xlim=my.lim,
ylim=my.lim);
When examining the above data, it does not seem that there is much difference between these two calculations because the points fall almost perfectly in a line. The boxplots for each of these distances are also symmetric with respect to each other. Both of these distance methods find the shortest distance between two points. The conceptual difference between these two calculations is that the Haversine method assumes a spherical earth and ignores ellipsoidal effects, whereas the Meeus method assumes an ellipsoid, and finds the distance between the two points on that ellipsoid. The results of these two methods are very similar, but a method that provides a more accurate distance algorithm for geo-spatial calculations is distGeo which calculates the shortest path, called the geodesic, between two points on an ellipsoid.
[ Examining the data above, is there much difference between these two calculations?
What is the conceptual difference between these two calculations? Try ?distMeeus or ?distHaversine
Are the results similar?
Is there a more accurate distance algorithm for geo-spatial calculations? If so, what is it?
]
Let’s now do manhattan and euclidean which are more common in the “statistical clustering” domain. I will do euclidean.
dist.manhattan = dist( latlong[,1:2],
method="manhattan", diag=TRUE, upper=TRUE);
dist.manhattan.m = as.matrix( dist.manhattan );
rownames(dist.manhattan.m) = myLabels;
colnames(dist.manhattan.m) = myLabels;
dist.manhattan.df = as.data.frame( round( dist.manhattan.m, digits=1) );
dist.manhattan.df; ## too big
lookupPairwiseValue(dist.manhattan.df, "Juneau, AK", "Montgomery, AL");
## [1] 4418
We have converted the true latitude/longitude to a miles-type format, so the resulting table will report miles.
dist.euclidean = dist( latlong[,1:2],
method="euclidean", diag=TRUE, upper=TRUE);
dist.euclidean.m = as.matrix( dist.euclidean );
rownames(dist.euclidean.m) =
colnames(dist.euclidean.m) = myLabels;
dist.euclidean.df = as.data.frame( round( dist.euclidean.m, digits=1) );
dist.euclidean.df; ## too big
lookupPairwiseValue(dist.euclidean.df, "Juneau, AK", "Montgomery, AL");
## [1] 3179.8
You can compare the two with the code below (currently in comments).
x = dist.manhattan.df[,2]; # Juneau ... pick one location
y = dist.euclidean.df[,2];
my.lim = c(0, 4000 );
plot(x, y,
xlab="Manhattan",
ylab="Euclidean",
main="dist( Juneau, AK )",
xlim=my.lim,
ylim=my.lim);
plotXYwithBoxPlots(x, y,
xlab="Manhattan",
ylab="Euclidean",
main="dist( Juneau, AK )",
xlim=my.lim,
ylim=my.lim);
The main difference between these two calculations is that Euclidean distance is calculated as the shortest straight line distance between two points ("as the crow flies"), where as Manhattan distance used the assumption that you are constrained to city blocks. When using Manhattan distance, you are not able to walk along the diagonals because there would be buildings in the way. Instead, you must stick to the "grid" of the city streets.
Although the results for the two calculations seem to follow a similar pattern, I would not necessarily say that the calculations themselves are similar because they do not lie in a completely straight line when plotted against each other. It is clear that they are not as similar as the distances compared above.
[ What is the difference between these two calculations?
Are the results similar?
]
Aggregating or agglomerating data is typically called “clustering” and is generally considered to be a “unsupervised learning” method.
We use hclust to perform Hierarchical Clustering. The word “hierarchical” is used because the nature of the data is organized like a family genealogy. The bottom of the tree represents the descendants that eventually “link” to a common ancestor.
If you type ?hclust you can review the parameter options which we explored in a notebook. hclust is primarily a function of dist() [distance], and we have just computed some distances, so we could try and apply hclust to our data to see if our state capitals will cluster into meaningful regions.
There are several agglomeration methods (“linkage”) one could choose from. Each method takes the dist https://en.wikipedia.org/wiki/Hierarchical_clustering#Metric and performs some pairwise distance algorithm called a linkage criteria https://en.wikipedia.org/wiki/Hierarchical_clustering#Linkage_criteria.
I generally use either the “complete” linkage method or the “ward.D2” linkage method. You can read help ?hclust to better appreciate why: “A number of different clustering methods are provided. Ward’s minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters.”
If you are trying to link binary data (zeroes and ones) or genomic data (where the distances were computed using a genetic-distance algorithm), you may want to try the UPGMA approach: “average” or “centroid”.
Since I am from a large family, I will make a family tree analogy. For me, I am the 5th of 11 siblings (5 brothers, 5 sisters). If I wanted to cluster the siblings in a pair-wise fashion, how would I begin?
First, I would ask, which other sibling is most like me? Since this is a pair-wise approach, and there are 11 siblings, maybe at the initial stage, I will not be paired with another sibling.
In fact, at least one will not be paired because the total number is 11. And maybe more than one will not be paired in the initial stage. Remember “similarity” is being defined based on some distance-linkage method. And to use this approach, “similarity” needs to numerical data.
At each stage, pair-wise joining occurs until there is nothing left to join. The tree contains all of the elements. Every element (branch) eventually joins the main branch (trunk) of the tree.
We already have some data for the U.S. capitals and have computed a Euclidean distance using a capital-city’s longitude and latitude. Let’s choose to cut the result into 12 agglomerations. Why 12? It was a choice based on my life experience and intuition. As I reflected on why, I did some external searching that validates a choice in that range https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States. You certainly could run this analysis with another choice It is exploratory, and your intuition matters.
Geographically, I am saying the capital-city does represent the state. An ideal representation may be in the center (centroid) of the state. Remember this when we see the linkages with “New York”, the city is Albany, not Manhattan.
## ward.D2
hclust.ward2.dist.euclidean = hclust(dist.euclidean, method="ward.D2");
plot( hclust.ward2.dist.euclidean,
labels= myLabels );
rect.hclust( hclust.ward2.dist.euclidean, k=12 );
## complete
#
# hclust.complete.dist.euclidean = hclust(dist.euclidean, method="complete");
#
# plot( hclust.ward2.dist.euclidean,
# labels= myLabels );
# rect.hclust( hclust.ward2.dist.euclidean, k=12 );
#
cutreeIn the above example, the tree was cut into 12 groups based on the “distance” formula used and based on the “agglomeration” linkage technique invoice. I likely should have used a “geo-spatial” distance, but we will see that mere “euclidean” distance seems to perform okay.
A fancy word for this statistical tree is a “dendrogram”. I call each element of the tree a branch. The smallest branches (twigs) are the fundamental elements, in this case the cities. Over time, they merge with other small branches, and so on.
The relative height when this smallest branch merges with another branch demonstrates when the branch has found a similar pair-wise match (with another smallest branch or a merging branch). I call “Honolulu Hawaii” an isolate because the vertical height when it merges with another branch is the highest of all of the smallest branches (e.g., cities). “Juneau Alaska” is another isolate, but it does merge before “Hawaii”.
It would be nice if we could decompose this information and look at one cutree at a time. And color-code the distinctions. Using the function plot.hclust.sub we can do this.
source_url( paste0(path.github,"humanVerseWSU/R/functions-EDA.R") ); # EDA functions ...
## SHA-1 hash of file is dd72e464241952b23d5b08d2213099eccf7a86f6
hclust.ward2.dist.euclidean$labels = myLabels;
plot.hclust.sub(hclust.ward2.dist.euclidean, k=12);
## Registered S3 method overwritten by 'dendextend':
## method from
## text.pvclust pvclust
## [1] "Pruning 1 of 12"
## [1] "Pruning 2 of 12"
## [1] "Pruning 3 of 12"
## [1] "Pruning 4 of 12"
## [1] "Pruning 5 of 12"
## [1] "Pruning 6 of 12"
## [1] "Pruning 7 of 12"
## [1] "Pruning 8 of 12"
## [1] "Pruning 9 of 12"
## [1] "Pruning 10 of 12"
## [1] "Pruning 11 of 12"
## [1] "Pruning 12 of 12"
# plot.hclust.sub(hclust.complete.dist.euclidean, k=12);
Choose either hclust.ward2.dist.euclidean or hclust.complete.dist.euclidean and review how the U.S. state capitals are clustered. [I commented out one form, so you will have to re-run if you want to select that one.]
Comment on the “face validity” of this approach based on your understanding about how the U.S. regions are defined? Are the North/South Dakotas together? What about the North/South Carolinas? What about the Pacific Northwest? While living in Kentucky, some people called the area “Kentuckiana” meaning Kentucky/Indiana. Does that show up? Also note anything that seems peculiar.
When looking at the hclust.ward2.dist.euclidean, in general, the state capitals are clustered with nearby states, but not necessarily by region. Because Alaska and Hawaii do not have any neighboring states, these two states are each a part of their own cluster. The cluster in the western part of the United States generally have less states in their clusters than the states in the Midwest and eastern part of the country. Based on how I understand the U.S. regions to be defined, this approach did not cluster very accurately by region. Many of the regions are missing states, or have other states added to them. The North and South Dakotas did cluster together, even at the lowest level. The North and South Carolinas did end up in the same cluster, but at the lowest level, North Carolina is paired with Virginia and South Carolina merges in later. This clustering tree split up the Pacific Northwest, with Washington and Oregon being clustered with California and Nevada, and Idaho being clustered with Montana and Utah. "Kentuckiana" does show up in this clustering method, as Kentucky and Indiana merge with each other first, before they merge with any other states. Something that seems peculiar to me is that North Dakota, South Dakota, Wyoming, and Colorado are in the same cluster, because there is quite a bit of distance between Colorado and North Dakota. It seems that Colorado would be better in a cluster with one of its other neighboring states.
hclustI find hclust to be a nice initial perusal of the data.
If you want to understand the stability of a particular hclust to use it for something other than an “initial perusal of the data,” I would recommend pvclust which I introduced in the weekly notebooks. It is often used in peer-reviewed research, as it provides a p-value of sorts regarding the stability of the hclust structure.
Aggregating or agglomerating data is typically called “clustering”. In exploration, we can create some generic clustering techniques. Below we will create two adhoc clustering rules.
Recall the “movie dataset” with “Will Smith” and “Denzel Washington”. We have a collection of movies, and how much money each movie made at the box office. We could organize those movies by some arbitrary rules. For example:
library(devtools);
source_url( paste0(path.mshaffer, "will") );
## SHA-1 hash of file is 3c1c97d961577e9b47a12faa20dca74bfe755c87
source_url( paste0(path.mshaffer, "denzel") );
## SHA-1 hash of file is 6692d8c5629db1ae023691bb55de2664a48cdde0
movies.50 = rbind(will$movies.50, denzel$movies.50);
unique(movies.50$ttid); # are they in any shared movies ???
## [1] "tt0480249" "tt1386697" "tt0116629" "tt0119654" "tt0343818" "tt0454921"
## [7] "tt0448157" "tt0120912" "tt1409024" "tt0386588" "tt0814314" "tt0112442"
## [13] "tt0172156" "tt0120660" "tt6139732" "tt0145660" "tt2381941" "tt1815862"
## [19] "tt1596350" "tt1229340" "tt5519340" "tt0307453" "tt1155076" "tt0120891"
## [25] "tt0120783" "tt1502397" "tt0248667" "tt4682786" "tt3322364" "tt1025100"
## [31] "tt0114558" "tt0300051" "tt0284490" "tt0146984" "tt1837709" "tt0338466"
## [37] "tt0947802" "tt1823664" "tt0268397" "tt1082886" "tt5814534" "tt3721964"
## [43] "tt0416212" "tt0108149" "tt0328099" "tt0167427" "tt0466839" "tt0107478"
## [49] "tt7255568" "tt0466856" "tt0765429" "tt0139654" "tt0454848" "tt0455944"
## [55] "tt0328107" "tt1907668" "tt0453467" "tt1037705" "tt0107818" "tt1599348"
## [61] "tt0210945" "tt1272878" "tt1111422" "tt0477080" "tt2404435" "tt0145681"
## [67] "tt3766354" "tt0251160" "tt0097441" "tt0368008" "tt0112740" "tt2671706"
## [73] "tt0174856" "tt0104797" "tt0107798" "tt0119099" "tt0133952" "tt0313443"
## [79] "tt0427309" "tt0115956" "tt0107616" "tt0124718" "tt0168786" "tt6000478"
## [85] "tt0114857" "tt0112857" "tt0102789" "tt0092804" "tt0100168" "tt0117372"
## [91] "tt0088146" "tt0097880" "tt0102456" "tt0099750" "tt0091786" "tt0082138"
## [97] "tt0097373" "tt4283892" "tt1698652" "tt0118783"
loadInflationData();
movies.50 = standardizeDollarsInDataFrame(movies.50,
2000,
"millions",
"year",
"millionsAdj2000");
movies.50$cluster.arbitrary = NA;
str(movies.50);
## 'data.frame': 100 obs. of 13 variables:
## $ rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "I Am Legend" "Suicide Squad" "Independence Day" "Men in Black" ...
## $ ttid : chr "tt0480249" "tt1386697" "tt0116629" "tt0119654" ...
## $ year : num 2007 2016 1996 1997 2004 ...
## $ rated : chr "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ minutes : num 101 123 145 98 115 117 92 88 106 118 ...
## $ genre : chr "Action, Adventure, Drama" "Action, Adventure, Fantasy" "Action, Adventure, Sci-Fi" "Action, Adventure, Comedy" ...
## $ ratings : num 7.2 6 7 7.3 7.1 8 6.4 6.2 6.8 6.6 ...
## $ metacritic : num 65 40 59 71 59 64 49 49 58 58 ...
## $ votes : num 675193 588111 520657 507618 491489 ...
## $ millions : num 256 325 306 251 145 ...
## $ millionsAdj2000 : num 213 233 336 269 132 ...
## $ cluster.arbitrary: logi NA NA NA NA NA NA ...
## you do something here ...
# (1) populate cluster.arbitrary
for (i in 1:100)
{
if (is.na(movies.50$millionsAdj2000[i]))
{
movies.50$cluster.arbitrary[i] = "Cluster 1"
}
else if ((movies.50$millionsAdj2000[i]) < 1)
{
movies.50$cluster.arbitrary[i] = "Cluster 2"
}
else if (movies.50$millionsAdj2000[i] >= 1 & movies.50$millionsAdj2000[i] < 5)
{
movies.50$cluster.arbitrary[i] = "Cluster 3"
}
else if (movies.50$millionsAdj2000[i] >= 5 & movies.50$millionsAdj2000[i] < 50)
{
movies.50$cluster.arbitrary[i] = "Cluster 4"
}
else if (movies.50$millionsAdj2000[i] >= 50 )
{
movies.50$cluster.arbitrary[i] = "Cluster 5"
}
}
movies.50$cluster.arbitrary;
## [1] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
## [7] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
## [13] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 4" "Cluster 4"
## [19] "Cluster 4" "Cluster 5" "Cluster 1" "Cluster 5" "Cluster 5" "Cluster 5"
## [25] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 4" "Cluster 4" "Cluster 4"
## [31] "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 2" "Cluster 4"
## [37] "Cluster 4" "Cluster 5" "Cluster 5" "Cluster 3" "Cluster 1" "Cluster 3"
## [43] "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 1" "Cluster 5"
## [49] "Cluster 2" "Cluster 4" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
## [55] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
## [61] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5"
## [67] "Cluster 5" "Cluster 5" "Cluster 4" "Cluster 5" "Cluster 5" "Cluster 4"
## [73] "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 4" "Cluster 4" "Cluster 4"
## [79] "Cluster 4" "Cluster 5" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4"
## [85] "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 5"
## [91] "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4" "Cluster 4"
## [97] "Cluster 2" "Cluster 2" "Cluster 1" "Cluster 1"
# (2) summarize how many movies live in each (table count)
table(movies.50$cluster.arbitrary);
##
## Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
## 5 4 2 38 51
John Tukey emphasized that ordering the data and then splitting it based on the ordering was a fundamental premise of exploratory data analysis.
John Tukey proposed five elements as primary data for analysis.
minmaxSorting the data makes it easiest to find these data, and will be useful to find the other three exploratory summary features.
This is what I would call “slice and dice”. The data is cut in half, and the value of that middle “cutting point” is the Q2 which we call the median.
Next, the lower half could also be cut in half, and the value of that middle “cutting” point is Q1.
Then, the upper half could also be cut in half, and the value of that middle “cutting” point is Q3.
A common metric derived from this “median-split” procedure is called the interquartile range IQR which is defined as the distance between Q3 and Q1. It literally represents the middle 50% of the data; 50% of the elements of the dataset are between Q3 and Q1.
x = 1:99;
length(x);
## [1] 99
median(x);
## [1] 50
x[50];
## [1] 50
median(x[1:49]);
## [1] 25
x[25];
## [1] 25
median(x[51:99]);
## [1] 75
x[75];
## [1] 75
# probability-approach ...
# algorithm: the default type=7
stats::quantile(x, prob = c(0.25, 0.5, 0.75), type=1);
## 25% 50% 75%
## 25 50 75
Algorithms address various issues associated with dividing numbers and whether or not to include the dividing number in the subset division, but the principle holds.
We can generalize this idea by not always cutting the data in half (median-split as n=2, median-median-split as n=4). Instead, we could cut by tens (we call them deciles). Or we could cut by hundreds (we call them centiles). The function quantile performs this operation, and if you dig into the doStatsSummary function used in this course, you can see its application.
movies.50$cluster.deciles = NA;
str(movies.50);
## 'data.frame': 100 obs. of 14 variables:
## $ rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "I Am Legend" "Suicide Squad" "Independence Day" "Men in Black" ...
## $ ttid : chr "tt0480249" "tt1386697" "tt0116629" "tt0119654" ...
## $ year : num 2007 2016 1996 1997 2004 ...
## $ rated : chr "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ minutes : num 101 123 145 98 115 117 92 88 106 118 ...
## $ genre : chr "Action, Adventure, Drama" "Action, Adventure, Fantasy" "Action, Adventure, Sci-Fi" "Action, Adventure, Comedy" ...
## $ ratings : num 7.2 6 7 7.3 7.1 8 6.4 6.2 6.8 6.6 ...
## $ metacritic : num 65 40 59 71 59 64 49 49 58 58 ...
## $ votes : num 675193 588111 520657 507618 491489 ...
## $ millions : num 256 325 306 251 145 ...
## $ millionsAdj2000 : num 213 233 336 269 132 ...
## $ cluster.arbitrary: chr "Cluster 5" "Cluster 5" "Cluster 5" "Cluster 5" ...
## $ cluster.deciles : logi NA NA NA NA NA NA ...
## you do something here ...
# use # stats::quantile(x, prob=seq(0.1,0.9,by=0.1), type=1 );
# (1) how many NA's are there ... keep them NA's
sum(is.na(movies.50$millionsAdj2000))
## [1] 5
# (2) for the rest of the data, break it up into deciles
millionsAdj2000.withoutNAs = na.omit(movies.50$millionsAdj2000)
stats::quantile(millionsAdj2000.withoutNAs, prob=seq(0.1,0.9,by=0.1), type=1 );
## 10% 20% 30% 40% 50% 60% 70%
## 7.638768 20.177743 27.511542 37.259081 53.554796 64.786272 74.509802
## 80% 90%
## 115.650000 146.638920
deciles.vec = quantile(millionsAdj2000.withoutNAs, prob=seq(0.1,0.9,by=0.1), type=1 );
# (3) $cluster.deciles for a given movie should be NA, 1, 2, 3, ... 10
for (i in 1:100)
{
if (is.na(movies.50$millionsAdj2000[i]))
{
movies.50$cluster.deciles[i] = "NA"
}
else if ((movies.50$millionsAdj2000[i]) < deciles.vec[1])
{
movies.50$cluster.deciles[i] = 1
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[1] & (movies.50$millionsAdj2000[i]) < deciles.vec[2])
{
movies.50$cluster.deciles[i] = 2
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[2] & (movies.50$millionsAdj2000[i]) < deciles.vec[3])
{
movies.50$cluster.deciles[i] = 3
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[3] & (movies.50$millionsAdj2000[i]) < deciles.vec[4])
{
movies.50$cluster.deciles[i] = 4
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[4] & (movies.50$millionsAdj2000[i]) < deciles.vec[5])
{
movies.50$cluster.deciles[i] = 5
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[5] & (movies.50$millionsAdj2000[i]) < deciles.vec[6])
{
movies.50$cluster.deciles[i] = 6
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[6] & (movies.50$millionsAdj2000[i]) < deciles.vec[7])
{
movies.50$cluster.deciles[i] = 7
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[7] & (movies.50$millionsAdj2000[i]) < deciles.vec[8])
{
movies.50$cluster.deciles[i] = 8
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[8] & (movies.50$millionsAdj2000[i]) < deciles.vec[9])
{
movies.50$cluster.deciles[i] = 9
}
else if ((movies.50$millionsAdj2000[i]) >= deciles.vec[9])
{
movies.50$cluster.deciles[i] = 10
}
}
movies.50$cluster.deciles;
## [1] "10" "10" "10" "10" "9" "9" "10" "10" "9" "10" "6" "7" "9" "9" "10"
## [16] "10" "5" "5" "5" "8" "NA" "10" "9" "9" "7" "9" "6" "3" "3" "2"
## [31] "2" "3" "4" "4" "1" "4" "4" "6" "8" "1" "NA" "1" "4" "2" "4"
## [46] "4" "NA" "6" "1" "2" "8" "8" "8" "7" "7" "7" "6" "8" "8" "8"
## [61] "9" "6" "5" "6" "7" "7" "7" "7" "5" "6" "8" "5" "5" "6" "9"
## [76] "3" "5" "5" "3" "7" "3" "3" "3" "2" "3" "2" "4" "2" "3" "5"
## [91] "4" "1" "2" "1" "1" "2" "1" "1" "NA" "NA"
# (4) summarize how many movies live in each (table count)
table(movies.50$cluster.deciles);
##
## 1 10 2 3 4 5 6 7 8 9 NA
## 9 10 9 10 9 10 9 10 9 10 5
Rather than clustering on distance-linkage in a pair-wise fashion, we can cluster based on randomly selecting just k points in our data and begin identifying their nearest neighbors using some distance approach.
For example, if k=3, we would randomly select three of our data points. We would then compute the distances from all of the remaining points to these 3 anchor points. The points that are closest to a given anchor will be assigned to that anchor. At the end of the stage, we now have new data, so a new centroid is determined. At this point, the centroid https://en.wikipedia.org/wiki/Centroid is likely not one of our data points, but a location within the given centroid cluster. In the “naive” approach, you merely take an average (mean) of all the members of your cluster. More advanced approaches (the default “Hartigan-Wong” of kmeans) utilize deviations from the average, called a sum-of-squares approach. This is why in our kmeans-notebook we analyzed the wss to ascertain how many clusters k should we use.
Regardless, after the new centroids (centers) for the clusters are determined, the process iterates. All distances are computed from all points to the new centroids; points are assigned to a given centroid cluster (in this example: 1, 2, 3); a new centroid center is computed, and we repeat the process until a stopping rule is reached: maybe we have exhausted the number of iterations allowed (iter.max parameter of kmeans)? Or maybe we are not changing membership of any of the data points? Or maybe we have met some objective (like wss)?
It is possible to get a kmeans result by merely starting with 3 different random points. In general, kmeans is fast (and our computers are so much faster than the computers of 1990: the first computer I built in 1995 had 32MB of RAM, a Pentium processor https://en.wikipedia.org/wiki/Pentium#Pentium and a 900MB Cheetah hard-drive).
Anyway, we can utilize the parameter nstart to try many different starting values, and let the program identify the best, most consistent solution.
I would recommend the default algorithm Hartigan-Wong with iter.max=100 and nstart=100. You can test the timings from the default values: iter.max=10 and nstart=1. Since the data is likely multidimensional stars are the best way to summary the results and membership of kmeans. Please see the “kmeans-notebook” for examples.
For the 50 U.S. cities, we harvested climate data. In the “Wikipedia” notebook, details of that data provenance were outlined. If we want to compare the cities using the climate data, we have to build a matching dataframe, which means we have to select features that exist in each city’s climate data set.
There are four temperature features that are always present:
Additionally, for precipitation we have:
For each of these features we have 12 months of data. This is a nice dataset. Let’s see what we can do with it.
We should begin by doing a bit of peripheral research on the topic, to gain “domain knowledge” so we know what to do with the data. A few elements that would benefit.
https://www.climate.gov/news-features/featured-images/whats-coldest-day-year
When graphing data to visualize, it is essential that you keep the scales uniform so quick-visual comparisons are accurate. I gave you a task to practice the idea of creation that “one informative” research graphic. And I now present my version for you to use and critique based on your efforts. I am a plot guy, so some of you may have a different `ggplot2 type solution. Ultimately, the intent of this graphic is to best summarize the data in a meaninful way for exploratory analysis.
climate = utils::read.csv( paste0(path.mshaffer, "_data_/state-capitals/final/state-capitals-climatedata.txt"), header=TRUE, quote="", sep="|");
plotTemperatureFromWikipediaData(climate, city.key="capital", city.val="Helena");
plotTemperatureFromWikipediaData(climate, city.key="capital", city.val="Baton Rouge");
I think that this graphic does a good job of displaying the climate information in a visual manner. I like that it is color coded based on how warm or cold a temperature is and I think that it is nice how it displays so many different variables on one plot. One thing that I might change about this plot is to make the lines and the points less chunky. I think that overall, this would help de-clutter the graphic. I also think that it might be helpful to add a separate bar to show snowfall, rather than having it on the same chart as the rainfall. Lastly, if there is a way to make the shading between the record and average temperatures more transparent, it might be helpful to be able to see the grid lines throughout the graphic. Overall, this graphic is aesthetically pleasing. The only thing that I think would make it better would be to make the lines and points thinner, as mentioned above. This might help with the overall organization of the graph, as it would seem less cluttered. I think that this graph is highly functional and does an excellent job portraying relevant information. Looking at this graphic and drawing conclusions would be much easier than looking through the values of a data frame or table.
[ What do you like about this graphic?
What do you dislike?
Is it aesthetically pleasing? Is it functional?
]
We can achieve a side-by-side comparison using the function described below. The first city will be graphed on the left, the second city on the right.
compareTwoCitiesClimates(climate, city.key="capital", city.1="Helena", city.2="Baton Rouge");
Similar to what I wrote above, I like how this graphic is able to display information based on multiple different factors. The way the information is displayed, it is easy to draw conclusions about this data. This graphic in particular makes it easy to compare the climate between two different locations. One thing that I don't like about this graphic in particular is the way that a different number of grid lines are shown on the y-axis of each graph. Although the y-axes are the same scale between the two graphs, it does not look that way at first glance. I think that adding an equivalent number of grid lines to each graph, even if the points on the graph do not span them all, would solve this problem. The visible grid lines that appear on each graph are the same, there are just certain ones that are unique to each graph (120 on Baton Rouge and -20, -40 on Helena). I also think that it would be beneficial to make the shading transparent so the grid lines are able to be seen throughout the graphic. The difference between rain and snow on this graphic is that rain is displayed as a point each month, whereas snow is displayed as a line graph that is shaded underneath. I don't necessarily think that this is a bad approach, but I think that if the snow and rain had their own, separate areas in the graphic it might be easier to read. When there is both snow and rain in a region, it seems a little hard to determine what the actual snowfall is.
[ What do you like about this graphic?
What do you dislike?
Are the y-axis the same scale? Are the visible gridlines for each the same?
What is the difference between rain and snow on the graphic? Was that a good approach? How would you have done it?
]
We are in exploration, so the “one” research graphic may be very different than the “one” formal graphic designed for a client. Typically, the final graphic needs to meet certain criteria:
To demonstrate the differences, I created a mockup of our 50 U.S. cities and put it into a nice finalized product form called “highcharts”.
http://md5.mshaffer.com/WSU_STATS419/_EXAMPLES_/fiddle_usmap/
It pulls data in real time (using AJAX) to grab the weather at the latitudes/longitudes we defined in our Wikipedia notebook.
Once you have a template built, it is rather easy to modify it. Here I changed the background map, and all of the data/features stay the same:
http://md5.mshaffer.com/WSU_STATS419/_EXAMPLES_/fiddle_usmap/world.html
1:12 or maybe just one month per season (April, July, October, January)If we want to cluster cities, which decisions seem best? Why? As you can see from the code below, you just comment out two options, and can quickly rerun the analysis.
climate = utils::read.csv( paste0(path.mshaffer, "_data_/state-capitals/final/state-capitals-climatedata.txt"), header=TRUE, quote="", sep="|");
##################### WHICH MONTHS #####################
########################################################
months = 1:12; # all the data
#months = c(1,4,7,10); # one month of each of the four seasons
########################################################
month.abb; # ?month.abb
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.name;
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
month.name[months]; # these are the names of the months you are selecting ...
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
# this function would allow us to use different months as criteria and different climate-data keys. It is variadic and flexible. `key.n` are the names we will use for our new columns ...
climate.df = buildClimateDataFrame(climate, months, keys=c("Record high F (C)", "Average high F (C)", "Average low F (C)", "Record low F (C)", "Average precipitation inches (mm)", "Average snowfall inches (cm)"), keys.n = c("highmax", "highavg", "lowavg", "lowmin", "rain", "snow") );
climate.df;
names(climate.df); # this helps you see the indexes ...
## [1] "state" "capital" "st" "labels" "highmax.Jan"
## [6] "highmax.Feb" "highmax.Mar" "highmax.Apr" "highmax.May" "highmax.Jun"
## [11] "highmax.Jul" "highmax.Aug" "highmax.Sep" "highmax.Oct" "highmax.Nov"
## [16] "highmax.Dec" "highavg.Jan" "highavg.Feb" "highavg.Mar" "highavg.Apr"
## [21] "highavg.May" "highavg.Jun" "highavg.Jul" "highavg.Aug" "highavg.Sep"
## [26] "highavg.Oct" "highavg.Nov" "highavg.Dec" "lowavg.Jan" "lowavg.Feb"
## [31] "lowavg.Mar" "lowavg.Apr" "lowavg.May" "lowavg.Jun" "lowavg.Jul"
## [36] "lowavg.Aug" "lowavg.Sep" "lowavg.Oct" "lowavg.Nov" "lowavg.Dec"
## [41] "lowmin.Jan" "lowmin.Feb" "lowmin.Mar" "lowmin.Apr" "lowmin.May"
## [46] "lowmin.Jun" "lowmin.Jul" "lowmin.Aug" "lowmin.Sep" "lowmin.Oct"
## [51] "lowmin.Nov" "lowmin.Dec" "rain.Jan" "rain.Feb" "rain.Mar"
## [56] "rain.Apr" "rain.May" "rain.Jun" "rain.Jul" "rain.Aug"
## [61] "rain.Sep" "rain.Oct" "rain.Nov" "rain.Dec" "snow.Jan"
## [66] "snow.Feb" "snow.Mar" "snow.Apr" "snow.May" "snow.Jun"
## [71] "snow.Jul" "snow.Aug" "snow.Sep" "snow.Oct" "snow.Nov"
## [76] "snow.Dec"
##################### WHICH COLUMNS #####################
########################################################
#X = climate.df[,5:52]; # temperature
#X = climate.df[,5:76]; # everything (includes rain)
X = climate.df[,5:20]; # temperature, 1 month per season
#X = climate.df[,5:28]; # everything (includes rain), 1 month per season
########################################################
rownames(X) = climate.df$labels;
Xs = scale(X);
X = climate.df[,5:20]; # everything ... you have to change months above to get this dataframe to be the correct size ... months = 1:12
rownames(X) = climate.df$labels;
Xs = scale(X);
So let’s do some analysis with all of the data available to us. Most of the data is in Temperature, with ranges from -42 degrees Fahrenheit (Helena, Montana) to 122 (Phoenix, Arizona).
The precipitation data (rain and snow) is measured in inches. So should we scale the data. The answer in PCA and orthogonal projections is absolutely YES, but for hclust and kmeans is that always the case?
You can make a choice below, and observe how it influences your answers.
whichX = X;
# whichX = Xs;
k-means on All Climate FeaturesWith the selected features let’s perform k-means. Let’s select k=12. I will select all of the features, but you can change that if you wish.
colors = rainbow(50, s = 0.6, v = 0.75); # 50 colors for 50 states
# descriptive star plot to start
stars(whichX, len = 0.5, key.loc=c(12,2), draw.segments = TRUE);
## too busy, let's group them
x.start = 1;
x.end = 10;
for(i in 1:5)
{
stars( whichX[x.start:x.end,] ,
len = 0.5, key.loc=c(6,2), draw.segments = TRUE);
x.start = 1 + x.end;
x.end = 10 + x.end;
}
Above, you are just analyzing the general shapes. Which ones are “fuller” circles? Why?
Which ones are not very “full circles”? Why?
k = 12;
iterations = 100;
number.starts = 100;
whichX.kmeans = kmeans(whichX, 12,
iter.max=iterations,
nstart = number.starts); # default algorithm
stars(whichX.kmeans$centers, len = 0.5, key.loc = c(10, 3),
main = "Algorithm: DEFAULT [Hartigan-Wong] \n Stars of KMEANS=12", draw.segments = TRUE);
membership = matrix( whichX.kmeans$cluster, ncol=1);
membership = membership[order(membership),];
membership = as.data.frame(membership);
rownames(membership) = climate.df$labels;
colnames(membership) = c("Cluster");
membership;
print( table(membership) ) ;
## membership
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 1 5 6 4 3 8 4 5 7 1 5
# I believe in an older version of R these were called $centroids
attributes = as.data.frame( whichX.kmeans$centers );
rownames(attributes) = paste0("Cluster.",1:12);
attributes;
Alabama is the only state that is a part of cluster 1. Alaska is the only state that is a part of cluster 2. Arizona, Arkansas, California, Colorado, and Connecticut are the five states that are a part of cluster 3. Delaware, Florida, Georgia, Hawaii, Idaho, and Illinois are the six states that are a part of cluster 4. Indiana, Iowa, Kansas, and Kentucky are the four states that are a part of cluster 5. Louisiana, Maine, and Maryland are the three states that are a part of cluster 6. Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, and Nevada are the eight states that are a part of cluster 7. New Hampshire, New Jersey, New Mexico, and New York are the four states that are a part of cluster 8. North Carolina, North Dakota, Ohio, Oklahoma, and Oregon are the five states that are a part of cluster 9. Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, and Utah are the seven states that are a part of cluster 10. Vermont is the only state that is a part of cluster 11. Virginia, Washington, West Virginia, Wisconsin, and Wyoming are the five states that are a part of cluster 12.
[ Summarize your k-means findings for 12 clusters.]
Correlation, like distance, is an important feature of multivariate analysis. So let’s review some basic correlation related to our climate data. For simplicity, let’s consider “Record High Temperature” and “Record Low Temperature” and see how they correlate with other factors we have gathered from Wikipedia.
Recall, in this table, “Jan-Dec” are different months of the same temperature variable.
library(Hmisc); # p-values for correlation
high = subsetDataFrame(climate, c("key", "units"), "==", c("Record high F (C)",1));
high = merge(high, capitals, by=c("capital","state"));
high.X = high[,c(5:18,21)]; # numeric data
high.cor = round( cor(high.X), digits=2);
# high.cor.p = rcorr(as.matrix(high.X), type="pearson"); # p-values for statistical significance ... # str(high.cor.p);
# examine July (idx = 7)
as.data.frame( high.cor ) ; # so it will render nicely in RStudio
high.cor.july = high.cor[,7];
high.cor.july;
## Jan Feb Mar Apr
## 0.25 0.45 0.53 0.65
## May Jun Jul Aug
## 0.85 0.90 1.00 0.91
## Sep Oct Nov Dec
## 0.86 0.73 0.46 0.27
## latitude longitude population.2019.est
## -0.15 0.02 0.31
plot(high.cor.july[1:12], ylab="Correlation", xlab="Month number", main="Correlation Trend by Month")
The plot above visually displays the correlation of each month's record high with July's record high. As expected, the cooler months (months further from July) have weaker correlation with July than the warmer months (months closer to July) do. The correlation for the record high for any given month is positively correlated with July (for all months of the year). This suggests that if the record high for July were to increase, the record high for every other month would also increase, and vice versa. We can also say that if the record high for July were to decrease, the record high for every other month would also decrease, and vice versa. January is correlated with July (0.25). This correlation is very weak because it is relatively close to 0. Overall, this suggests that the likelihood of the record high temperature of January increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. February is correlated with July (0.45). This correlation is weak because it is closer to 0 than it is to 1. Overall, this suggests that the likelihood of the record high temperature of February increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. March is correlated with July (0.53). This correlation is moderately strong because it is close to 0.5. Overall, this suggests that the likelihood of the record high temperature of March increasing/decreasing as the record high temperature of July increases/decreases is moderately high. In other words, these two variables are likely to have a relationship, but it may not be very strong. April is correlated with July (0.65). This correlation is moderately strong because it is relatively close to 0.5. Overall, this suggests that the likelihood of the record high temperature of April increasing/decreasing as the record high temperature of July increases/decreases is moderately high. In other words, these two variables are likely to have a relationship, but it may not be very strong. May is correlated with July (0.85). This correlation is strong because it is relatively close to 1. Overall, this suggests that the likelihood of the record high temperature of May increasing/decreasing as the record high temperature of July increases/decreases is high. In other words, these two variables are likely to have a strong positive relationship. June is correlated with July (0.90). This correlation is strong because it is close to 1. Overall, this suggests that the likelihood of the record high temperature of June increasing/decreasing as the record high temperature of July increases/decreases is very high. In other words, these two variables are very likely to have a strong positive relationship. As stated below, July perfectly correlates with July (1.00). This correlation is very strong and positive, which can be expected because they are the same data. August is correlated with July (0.91). This correlation is strong because it is close to 1. Overall, this suggests that the likelihood of the record high temperature of August increasing/decreasing as the record high temperature of July increases/decreases is very high. In other words, these two variables are very likely to have a strong positive relationship. September is correlated with July (0.86). This correlation is strong because it is relatively close to 1. Overall, this suggests that the likelihood of the record high temperature of September increasing/decreasing as the record high temperature of July increases/decreases is high. In other words, these two variables are likely to have a strong positive relationship. October is correlated with July (0.73). This correlation is moderately strong. Overall, this suggests that the likelihood of the record high temperature of October increasing/decreasing as the record high temperature of July increases/decreases is moderately high. In other words, these two variables are likely to have a relationship, but it may not be very strong. November is correlated with July (0.46). This correlation is weak because it is closer to 0 than it is to 1. Overall, this suggests that the likelihood of the record high temperature of November increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. December is correlated with July (0.27). This correlation is very weak because it is relatively close to 0. Overall, this suggests that the likelihood of the record high temperature of December increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. Longitude is correlated with record temperature in July (0.02). This correlation is extremely weak, as its value is almost 0. Based on this number, I would not expect a relationship between longitude and the record high temperature in July. Latitude is correlated with record temperature in July (-0.15). This correlation very weak, as its value is relatively close to 0. One thing to note is that this correlation is negative, so if there is a relationship between the two variables, it would be inversely proportional. In other words, as latitude increases, the record temperature in July would decrease, and vice versa. These two variables have a lower likelihood of having a relationship than two variables with a higher correlation. Population is correlated with record temperature in July (0.31). This correlation is weak because it is relatively close to 0. Overall, this suggests that the likelihood of the population increasing/decreasing as the record high temperature of July increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. Intuitively, I would think that the months that have more extreme weather would correlate most with latitude for this data. These months would include the winter months of November, December, January, and February. This can be confirmed by looking at the data frame above. Looking at the data frame, it also seems that the summer months (June, July, and August) correlate least with latitude for this data. This correlation always has the same sign, negative.
Describe the correlation of July in “Record high F (C)” to the other numeric factors printed above.
x is correlated with y (0.00). This correlation is positive/negative which means … This correlation is strong/weak because … overall, this suggests …
July perfectly correlates with July (1.00). This correlation is positive and very strong. This is because they are the same data.
With the months, you can note each, or plot a trend showing them, and discussing them briefly as a trend.
latitude is a measure of north/south, so be certain to apply the correlation value with some meaning. be certain you know which direction is positive or negative to correctly interpret the sign of the correlation.
longitude is a measure of east/west, so be certain …
population is the size of the city
intuitively, which months do you think correlate most with latitude for this data? which correlate the least? is the correlation always the same sign (positive/negative), or does it change? [You can use the dataframe output to do this analysis, or create your own subset]
library(Hmisc); # p-values for correlation
low = subsetDataFrame(climate, c("key", "units"), "==", c("Record low F (C)",1));
low = merge(low, capitals, by=c("capital","state"));
low.X = low[,c(5:18,21)]; # numeric data
low.cor = round( cor(low.X), digits=2);
# low.cor.p = rcorr(as.matrix(low.X), type="pearson"); # p-values for statistical significance ... # str(low.cor.p);
# examine Jan (idx = 1)
as.data.frame( low.cor ) ; # so it will render nicely in RStudio
low.cor.january = low.cor[,1];
low.cor.january;
## Jan Feb Mar Apr
## 1.00 0.95 0.93 0.92
## May Jun Jul Aug
## 0.89 0.80 0.74 0.77
## Sep Oct Nov Dec
## 0.90 0.87 0.91 0.97
## latitude longitude population.2019.est
## -0.72 -0.33 0.34
plot(low.cor.january[1:12], ylab="Correlation", xlab="Month number", main="Correlation Trend by Month")
Describe the correlation of January in “Record low F (C)” to the other numeric factors printed above.
The plot above visually displays the correlation of each month's record low with January's record low. As expected, in general, the cooler months (months closer to January) have stronger correlation with January than the warmer months (months further from January) do. It is important to note that in this plot, the y-axis only goes down to 0.75. The lowest correlation between all of these months is 0.74, which is still a relatively strong correlation. Because of this, it is acceptable to say that the record low temperature of all months have a fairly strong positive relationship with the record low temperature in January. The correlation for the record low for any given month is positively correlated with the record low for January (for all months of the year). This suggests that if the record low for January were to increase, the record low for every other month would also increase, and vice versa. We can also say that if the record low for January were to decrease, the record low for every other month would also decrease, and vice versa. January is perfectly correlated with January (1.00). This correlation is very strong and positive, which can be expected because they are the same data. February is correlated with January (0.95), March is correlated with January (0.93), April is correlated with January (0.92), May is correlated with January (0.89), September is correlated with January (0.90), October is correlated with January (0.87), November is correlated with January (0.91), and December is correlated with January (0.97). These correlations are all very strong because they are close to 1. Overall, this suggests that the likelihood of the record low temperatures of February, March, April, May, September, October, November, and December increasing/decreasing as the record low temperature of January increases/decreases is very high. In other words, these variables are very likely to have a strong positive relationship with the record low temperature of January. The more moderate correlations (although they are still relatively strong) with respect to the record low in January are June (0.80), July (0.74), and August (0.77). Overall, this implies that there is a relatively high likelihood of the record low temperatures of these months following the trend of the record low temperature in January, but the likelihood is not as high as it is for other months (with higher correlation). Longitude is correlated with record low temperature in January (-0.33). This correlation is relatively weak, as its value is closer to 0 than it is to -1. Based on this number, I would expect a very weak, inversely proportional relationship between longitude and the record low temperature in January. Latitude is correlated with record low temperature in January (-0.72). This correlation moderately strong, as its value is relatively close to -1. Because this correlation is negative, the relationship between the two variables is inversely proportional. In other words, as latitude increases, the record low temperature in January would decrease, and vice versa. These two variables have a relatively high likelihood of having a relationship. Population is correlated with record low temperature in January (0.34). This correlation is weak because it is relatively close to 0. Overall, this suggests that the likelihood of the population increasing/decreasing as the record low temperature of January increases/decreases is relatively low. In other words, these two variables have a lower likelihood of having a relationship than two variables with a higher correlation. A positive correlation for this variable does make sense though, because most people are likely to rather live somewhere with a more mild winter.
Similar to “high” writeup, but for the “low” data.
In the social sciences (e.g., Karl Weick), the concept of “sense making” refers to “the process by which people give meaning to their collective experiences”. I have used this framework in my high-technology innovation research (See Figure 1 of http://www.mshaffer.com/arizona/pdf/LoneGenius.pdf, my rubric concept comes from learning-theory growth models: Nascent, Adolescent, Mature.)
This final topic is reflective: we are thinking about how we think.
The syllabus defined statistics as “the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.” (See https://en.wikipedia.org/wiki/Statistics)
There are 5 elements mentioned: collection, organization, analysis, interpretation, and presentation of “data”. Are those equally weighted? That is, should we devote 20% of our time to each of those? Now, consider the “analysis” stage. I have suggested there are two camps: exploratory and confirmatory data analysis. Are those equally weighted? That is, should we devote 50% of our time to each of those? Now, in an “equally-likely” scenario, we would have.
x = c(20,20,10,10,20,20);
x.labels = c("collection", "organization", "EDA analysis", "CDA analysis", "interpretation", "presentation");
x.colors = c("blue", "lightgreen", "green", "darkgreen", "orange", "red");
barplot(x,
col = x.colors,
ylim = c(0, 20),
ylab = "Proportion: (Sums to 100)",
main="Statistics as the Study of 'Data'");
text(1.14* (1:6), par("usr")[3], col = x.colors, labels = x.labels, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=.75);
“Data analytics is the application of powerful new methods—drawn from computer science, mathematics and statistics, and domain sciences—to collect, curate, analyze, discover and communicate knowledge from ‘big data’.” https://data-analytics.wsu.edu/ (Accessed October 2010)
I love data.
I also love math/physics. I also love exploratory data analysis. I also love computational statistics or statistical computing https://en.wikipedia.org/wiki/Computational_statistics. I also love thinking about developing the one graphic to summarize data most effectively.
The idea of sharing in the learning process is an important aspect of the apprenticeship model. You are learning a trade (data analytics). I have experience in this trade. My job as the instructor is to provide you with a variety of “situated-learning” experiences To help you understand the nature of the trade. This exam is an example of such an experience.
Below are the core requirements for the data analytics program:
These are not the tools of the trade, but hopefully, they introduce you to key tools of the trade. What exactly are tools of the trade? [You will have an opportunity to write a response below.]
This video was recently shared with me that highlights some distinctions among persons practicing various forms of data analysis https://www.youtube.com/watch?v=uHGlCi9jOWY. As an orthogonal projection, I would create two axes. On the horizontal axis (x-axis), I would place “theory of data” to the left and “application of data” toward the right. On the vertical axis (y-axis), I would place “care for data integrity” at the top and “less care for data integrity” at the bottom.
As someone that is coming from industry, having hired young people like you out of Computer Science, Electrical-Computer Engineering, I have opinions related to skills of the trade.
[This is worth 20 points.
Specifically, address:
barplot of your opinion within your response would seem appropriatex = c(10,10,20,15,25,20); ## change these values and discuss ...
x.labels = c("collection", "organization", "EDA analysis", "CDA analysis", "interpretation", "presentation");
x.colors = c("blue", "lightgreen", "green", "darkgreen", "orange", "red");
barplot(x,
col = x.colors,
ylim = c(0, max(x)),
ylab = "Proportion: (Sums to 100)",
main="Statistics as the Study of 'Data'");
text(1.14* (1:6), par("usr")[3], col = x.colors, labels = x.labels, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=.75);
what tools of the trade should you be acquiring from the core courses? how are you doing in that acquisition process (e.g., tool X is … and right now I feel like my understanding/proficiency of tool X … ) …
utilize the provided plot script to place the coure-course categories on the proposed x-y graph related to analytics practice (Applied vs Theoretical) and care of data integrity (Great Care vs Little Care) … Also place your personal assessment on the plot script provided
evaluate your skill-level on the six “skills of the trade”: Emerging (Nascent), Developing (Adolescent), Mastering (Mature). explain your evaluation and include other important skills you believe are relevant that are not included
Any other comments you would like to share.
]
(1) I think that the largest proportion of "statistics" (25%) should be devoted to interpretation of the data because without proper interpretation, the field of statistics loses its purpose. Along those same lines, I think that presentation of the data should be 20% of "statistics". I think that the way that data and the conclusions that are drawn from it are presented is important, because if others are not able to understand why the conclusions drawn, or even what conclusions were drawn, those conclusions become almost useless. I think that as a data analyst, a very important part of someone's job is to make sure that they are able to effectively communicate their results. Also deserving 20% devotion is EDA analysis. This branch of statistics is very important for allowing new predictions and discoveries about all aspects of life. Next, I placed CDA analysis at 15%. I think that it is important to consistently challenge assumptions that are made about data, and make sure that the conclusions that are drawn are still valid. Coming in at 10% is both collection and organization of data. Although these are indeed important steps in the process, they are also steps that yield data sets that are able to be reused across multiple different analyses.
(2) The tools of the trade that I think should be acquired from the core courses include:
- the ability to think critically and solve problems
- the ability to apply what is learned in the degree program to the "real" world
- how to use programming languages to help analyze data.
Right now, I think that my understanding / proficiency of all three of these "tools of the trade" would be ranked at an intermediate level. I think that this is reasonable, considering I am taking a wide variety of courses that introduce me to and help develop skills in many different aspects of data analytics. I don't imagine I will consider myself "proficient" or a master in any of these things until I am in a certain niche of the job market where I will have the opportunity to further develop my skills.
(3) see plot below...
(4) Of the six "skills of the trade", I would rate my skill-level as emerging for each of them. Although I am getting my degree in Data Analytics, it doesn't seem like I have had much experience working with data thus far. Many of the classes that I have taken seem to have fallen more on the "theoretical" side of the spectrum. I think that this course has helped me (or I'm anticipating it will in the future based on where the course is going) to develop some, if not all, of these skills, in particular tracking and documenting the analyses I perform, experimenting with different visualizations in search of an optimal "one graphic" result, and communicating and defending my results.
(5) Stat 419 has been the most challenging class that I have encountered in my degree program. While this can be frustrating at times, I appreciate that it is not only preparing me to be a better candidate for the job market, but it also allows me to exercise my problem solving skills, which will transfer over into other courses and projects in the future.
# x is -1 for perfectly theoretical
# x is 1 for perfectly applied
# y is -1 for no care whatsoever for data integrity
# y is 1 is perfect care for data integrity
########################### basic plot setup #####
plot(0,0, col="white",
ylim=c(-1.5,1.5), xlim=c(-1.5,1.5),
xlab = "",
ylab = "",
xaxt = 'n', bty = 'n', yaxt = 'n',
main = "Axiomatic Perspective on Practice/Care",
);
segments(-1,0,1,0, col="#999999");
segments(0,-1,0,1, col="#999999");
text(-1.1,0, "Theoretical Data Practice", cex=0.5, srt = 90);
text(1.1,0, "Applied Data Practice", cex=0.5, srt = -90);
text(0,1.1, "Great Care for Data Integrity", cex=0.5, srt = 0);
text(0,-1.1, "Little Care for Data Integrity", cex=0.5, srt = 0);
########################### basic plot setup #####
########################### you can add elements here #####
## this point represents the professor's self-perception
# points(0.75, 0.95, pch=20, col="blue");
# text(0.75, 0.95, "Shaffer (self)", col="blue", cex=0.75, srt = 45, pos=2);
############# TODO ###### ... maybe change color for each data point
# https://brand.wsu.edu/visual/colors/
# crimson = #981e32
# you need to change the x,y from 0,0 ...
# you can change col ... cex (font size), srt (angle), and pos = 1,2,3,4
#
points(0.25, 0.7, pch=20, col="#981e32");
text(0.25, 0.7, "Jailee (self)", col="#981e32", cex=0.75, srt = 45, pos=3);
#
# ## evaluate the Course Categories of Tools of the Trade
# ## give them a score
#
points(-0.9, 0, pch=20, col="#ed3edc");
text(-0.9, 0, "Math(s)", col="#ed3edc", cex=0.5, srt = 45, pos=3);
#
#
points(0.3, 0.2, pch=20, col="#613af0");
text(0.3, 0.2, "Computer Science", col="#613af0", cex=0.5, srt = 45, pos=3);
#
points(0.6, 0.7, pch=20, col="#3abff0");
text(0.6, 0.7, "Machine learning", col="#3abff0", cex=0.5, srt = 45, pos=3);
#
points(-0.8, 0.7, pch=20, col="#0078cf");
text(-0.8, 0.7, "Statistics", col="#0078cf", cex=0.5, srt = 45, pos=3);
#
points(0.7, 0.8, pch=20, col="#1ac489");
text(0.7, 0.8, "Data analytics", col="#1ac489", cex=0.5, srt = 45, pos=1);
#
# # You have a track (e.g., Business)
points(-0.4, 0.3, pch=20, col="#f09116");
text(-0.4, 0.3, "Core discipline - Actuarial Science", col="#f09116", cex=0.5, srt = 45, pos=3);